This is a blog series about creating a web based editor for JavaScript/Node.JS and web development.
Read part 1:, part 2, part 3:, part 4, part 5 (Swedish), part 6:, part 7:, part 8: or part 9 (Swedish)

Better Unicode support

A few weeks back I got annoyed that the Unicode support for my editor was only half baked. When inserting an emoji the proceeding character would sometimes be humping the emoji. The reason is that the glyph width (width of the rendered character) of emojis is wider then regular mono-space letters. And the editor layout uses a very simple grid with row and column array in array. For example grid[5][7] would be column 7 on row 5. Calculating where that character should be placed on the screen was straight forward: column * columnWidth.

So fixing the width of the emoji's should be a simple fix right? Just add some space behind it. I made a big regular expression to test if a character was an emoji, then skipped a column behind it so the following character wouln't overlap.

Unicode surrogate pairs

There was one issue though, Unicode surrogate pairs: Two characters that combine into one glyph. My editor had handled them fine before, as they had two character they took up two columns, so no space needed to be added.

surrogate modifier pairs

Then I discovered Unicode surrogate modifier pairs: Two characters following a surrogate pair, which sets the color of the emoji, that's 4 characters combined into one emoji/glyph.
Now calculation for position became more complicated and I had to refactor a lot of code, like what column to place the caret on if you click anywhere on the screen. Where to render the caret, etc. Almost all features in the editor is individual plugins, and there is at least 10 plugins for rendering stuff like parenthesis highlighting, background when you select text, fade out effect in the margin, etc...

Unicode combiner character

When I was almost done with the refactor I discovered the Unicode combiner character, that lets you combine emojis into one. For example 👨 + 👩+ 👦 = 👨‍👩‍👦 (man+woman+child=family).

Unicode variation characters

Then I did discover the Unicode variation characters, that lets you add a skin color to an emoji.
Notice that I wrote above that surrogate modifiers let you change the skin color, yep, there are two ways to change the skin color in Unicode...

Tab indentation vs columns

While refactoring I though, why not support tab indentation and tab columns too: Manual indentation is for languages like Python that has significant white space. Note that my editor automatically indents and formats JavaScript as you type, so you don't have to use tabs nor spaces for indentation in my editor. But there is no way to auto-format Python code, as blocks of code in Python is not ended by curly brackets nor "end if", the code block instead ends where the code is de-indented.

Tab columns is common in old editors that lets you neatly format data into columns, which can become tedious if you use spaces. Tab columns will automatically calculate and add the space.

The width of a glyph

So how can you know the width of a character/glyph?
First I assumed that all emojis was two glyphs wide, and used a regular expression to find out if a character was an emoji. But that didn't work - because not all emojis has the same width! The width of a glyph is defined by the font. And the same character can have different width's in different fonts! So you haved to "ask" the font what the glyph with is... for all characters!

Performance

So after adding proper unicode support, rendering was 10x slower, because now every character has to be checked if it's a surrogate, combiner, or variation character. That said, the most expensive part is still putting the pixels on the screen, which include calculating sub-pixel precision etc. So despite the extra work needed by the CPU I managed to make the new code only slightly slower then before the Unicode support.

Demo

Copy the text in the textarea below, then paste it into your favorite editor. Then paste it into Webide (my editor) for comparison.

What happens when you place the caret at the right side of the colored baby, and press backspace, is the color removed from the baby emoji, or is the whole emoji removed?
Do you get columns, or does the tab have a static width? Where is the caret placed if you click in the middle of the emoji?


Written by April 17th, 2020.


Follow me via RSS:   RSS https://zäta.com/rss_en.xml (copy to feed-reader)
or Github:   Github https://github.com/Z3TA